In general, prosody refers to the variation over time of speech elements such as pitch, energy (loudness) and duration. As used herein, “pitch” refers to fundamental frequency (F0). Prosodic components provide a great deal of information in speech. For example, varying duration of pauses between some words or sounds can impart different meanings to those words. Changing the pitch at which certain parts of a word are spoken can change the context of that word and/or indicate excitement or other emotion of the speaker. Variations in loudness can have similar effects. In addition to conveying meaning, prosodic components strongly influence the identity associated with a particular speaker's voice. Unpublished research by the present inventors has shown that people are able to recognize speaker identity based on pure prosodic stimuli (i.e., “beep”-like sounds that were generated using a single sinusoid that followed the evolvement of pitch, energy and durations in recorded speech).
Because prosodic components are important to speaker identification, it is advantageous to modify one or more of these components when performing voice conversion. In general, “voice conversion” refers to techniques for modifying the voice of a first (or source) speaker to sound as though it were the voice of a second (or target) speaker. Existing voice conversion techniques have difficulty converting the prosody of a voice. In many such techniques, the converted speech prosody closely follows the prosody of the source, and only the mean and variance of pitch are altered. Although other techniques have been studied, there remains a need for solutions with better performance.